Modern Illumina Files
Each sample's set of files is under a directory with name in the structure: Sample_[SAMPLENAME].
Many of the files with start with the file label: [SAMPLENAME]_[BARCODE]_L00[#]
Overview of notable files in sample directories
- [FILELABEL]_R[READNUM]_[FILENUMBER].fastq.gz
- These are gzipped FASTQ files. The read number can be 1 or 2 and denotes which read the tags come from; paired end flowcells will have both read numbers. The file number is three digits and starts counting from 001. All paired end flowcells with have the same number of fastq files for read 1 and read 2.
- [FILELABEL].sorted.bam
- This is a sorted BAM file containing all of the reads.
- [FILELABEL].uniques.sorted.bam
- This is a sorted bam file containing only uniquely mapping reads passing filters.
- [FILELABEL].uniques.sorted.bam.bai
- This is the index to the uniques bamfile, allowing random access.
- [FILELABEL]_spot.txt
- This file contains a SPOT score--the percentage of uniquely mapping tags in hotspots.
- [FILELABEL]_spotdups.txt
- Contains the duplication metrics calculated by Picard on the same randomly selected set of tags used by the SPOT score.
- [FILELABEL]_uniques.bed.starch
- This is a compressed BED file of the tags in the uniques BAM file; it might not be present if you have not specifically requested BED input. It can be uncompressed using the freely available unstarch program in the bedops toolset.
- [FILELABEL]_75_20.[GENOME].bw
- This is a density file in the bigwig format, suitable for use in a UCSC browser. This is the currently generated density file format, as it takes up less space than .wig files.
- [FILELABEL]_75_20.[GENOME].wig
- This is a density file in the wig format, suitable for use in a UCSC browser; there will also be a matching .wib file. Older data may have this format.
- [FILELABEL].tagcounts.txt
- This is a file listing out different tag counts for.
- [FILELABEL]_R[READNUM]_fastqc
- This is a directory produced by FastQC, a raw sequence quality control checker. You can read their help manual to get an idea of how to interpret the data. They have examples of a bad sequence report and a good sequence report.
Overview of file formats
FASTQ format
Each FASTQ entry for Illumina's FASTQ is in four lines:
- The sequence identifier, starting with @
- The sequence
- A + indicating the uality score identifier line
- The quality score -- Sanger format (Phred+33)
The sequence identifier is in the following format:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x- pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>
Element |
Requirements |
Description |
<instrument> |
Characters allowed: a-z, A-Z, 0-9, underscore |
Instrument ID |
<run number> |
Numerical |
Run number on instrument |
<flowcell ID> |
Characters allowed: a-z, A-Z, 0-9 |
The flowcell label--useful when writing into us |
<lane> |
Numerical |
Lane number |
<tile> |
Numerical |
Tile number |
<x_pos> |
Numerical |
X coordinate of cluster on the tile |
<y_pos> |
Numerical |
Y coordinate of cluster on the tile |
<read> |
Numerical |
Read number, either 1 or 2 (for paired end flowcells) |
<is filtered> |
Y or N |
Y if the read is filtered, N otherwise. If the read is filtered, it should not be used. |
<control number> |
Numerical |
0 when none of the control bits are on, otherwise it is an even number |
<index sequence> |
ACTG |
The barcode sequence; empty if this was the only sample on a lane. |
Here is Illumina's example read from the manual:
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@
BED format
Our bed files are currently created using bedToBam on our BAM files. Older bed files might have fewer columns, but the first three required columns will always be present.
- chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
- chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
- chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
- name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
- mapping quality - taken from the BAM file, it represents −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. NOTE: This column is different than what UCSC's browser uses this column for
- strand - Defines the strand - either '+' or '-'.
Here is an example of BED lines:
chr1 10245 10280 HWI-ST700693_247:4:2208:18413:38502/1 31 +
chr1 10316 10351 HWI-ST700693_247:4:2313:1426:80330/1 12 -
chr1 13070 13105 HWI-ST700693_247:4:1311:20921:99196/1 32 +
To save space, we have compressed our BED files using the starch program. They can be uncompressed using the freely available unstarch program in the bedops toolset. Our BED files have also been sorted using the bed-sort program also in the bedops toolset
BAM format
BAM is the compact, binary form of the SAM format. You can translate BAM files into SAM using samtools.
BAM files can be converted to BED files using bedToBam in the bedtools suite.
Tag count categories
The tag count file is formatted with a count label and then a number on each line. Definitions for the labels are below:
Label |
Description |
u |
uniquely matching |
u-pf |
uniquely matching and passing filter |
u-pf-n |
uniquely matching, passing filter, no Ns |
u-pf-n-mm2 |
same as u-pf-n, but allows no more mismatches than 2 |
u-pf-n-mm2-mito |
same as u-pf-n-mm2, but also does not count matches to the mitochondrial chromosome |
qc |
no matching done, QC failure |
nm |
no match found |
mm |
multiple matches |
pf |
passes filter |
total |
total number of tags gathered |
There are also counts for tags aligned to the individual chromosomes.
Illumina Export Files
These are deprecated output files for current CASAVA pipelines and no longer available for new analysis, but older flowcells might have them as output.
The fields are as follows:
- Machine (Parsed from Run Folder name)
- Run Number (Parsed from Run Folder name)
- Lane
- Tile
- X Coordinate of cluster. As of RTA v1.6, OLB v1.6, and CASAVA v1.6, the X and Y coordinates for each clusters are calculated in a way that makes sure the combination will be unique. The new coordinates are the old coordinates times 10, +1000, and then rounded.
- Y Coordinate of cluster. As of RTA v1.6, OLB v1.6, and CASAVA v1.6, the X and Y coordinates for each clusters are calculated in a way that makes sure the combination will be unique. The new coordinates are the old coordinates times 10, +1000, and then rounded.
- Index sequence or 0. For no indexing, or for a file that has not been demultiplexed yet, this field should have a value of 0.
- Read number (1 for single reads; 1 or 2 for paired ends or multiplexed single reads; 1, 2, or 3 for multiplexed paired ends)
- Called sequence of read
- Quality string--In symbolic ASCII format (ASCII character code = quality value + 64)
- Match chromosome--Name of chromosome match OR code indicating why no match resulted (RM = repeat masked, for example match against abundant sequences, NM = not matched)
- Match Contig--Gives the contig name if there is a match and the match chromosome is split into contigs (Blank if no match found)
- Match Position--Always with respect to forward strand, numbering starts at 1 (Blank if no match found)
- Match Strand--"F" for forward, "R" for reverse (Blank if no match found)
- Match Descriptor--Concise description of alignment (Blank if no match found)
-
- A numeral denotes a run of matching bases
- A letter denotes substitution of a nucleotide: For a 35 base read, "35" denotes an exact match and "32C2" denotes substitution of a "C" at the 33rd position
- The escape sequence "^..$" represents an indel. An integer in the indel escape sequence (e.g. "10^2$18") indicates an insertion relative to reference of the specified size. A sequence in the indel escape sequence (e.g. "10^AG$20") indicates a deletion relative to reference, with the sequence given the deleted reference sequence.
- Single-Read Alignment Score--Alignment score of a single-read match, or for a paired read, alignment score of a read if it were treated as a single read. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat. -1 for shadow reads.
- Paired-Read Alignment Score--Alignment score of a paired read and its partner, taken as a pair. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat. Note that in single-ended analyses it is always blank.
- Partner Chromosome--Name of the chromosome if the read is paired and its partner aligns to another chromosome
- Partner Contig
-
- Not blank if read is paired and its partner aligns to another chromosome and that partner is split into contigs.
- Blank for single-read analysis
- Partner Offset
- If a partner of a paired read aligns to the same chromosome and contig, this number, added to the Match Position, gives the alignment position of the partner.
- If partner is a shadow read, this value is 0.
- If partner aligns to a different chromosome and/or contig, the number represents the absolute position of the partner.
- Blank for single-read analysis unless the record belongs to a part of a spliced RNA read.
- Partner Strand--To which strand did the partner of the paired read align? "F" for forward, "R" for reverse ("N" if no match found, blank for single- read analysis)
- Filtering--Did the read pass filtering? 0 - No, 1 - Yes.
Wig and Bigwig files
Density files can be in the wig or bigwig formats. We create our density files with a window of +/-75 basepairs once every 20 positions.
You can read more about the wig format here.
You can read more about the bigwig format here. That page also includes information on how to make bigwig file from wig files, and on extracting information from bigwig files.
Old Illumina System Files
Old flowcells (over a year or more) may show an alternative files and file structure than those documented above.
Most files will start with the format: s_[LANE]
Old multiplex flowcells will also have their files separated into bin folders of three digits, such as: 001 or 004.
- s_[LANE]_sequence.txt.gz
- A gzipped FASTQ file--similar to the FASTQ format described here, but the identification tag is different, the quality score indicator line contains the identification tag, and the encoding for the quality score will depend on which version of the software was used for alignment.
- s_[LANE]_export.txt.gz
- These are alignment files in Illumina's deprecated export format.
- uniques.lane1.[GENOME].bed.gz
- These are gzipped BED files--similar to the BED format described here but instead of a name column, the tag is included.